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Abstract —Place classification is a fundamental ability that 
a robot should possess to carry out effective human-robot 
interactions. It is a nontrivial classification problem which 
has attracted many research. In recent years, there is a 
high exploitation of Artificial Intelligent algorithms in robotics 
applications. Inspired by the recent successes of deep learning 
methods, we propose an end-to-end learning approach for the 
place classification problem. With the deep architectures, this 
methodology automatically discovers features and contributes 
in general to higher classification accuracies. The pipeline of 
our approach is composed of three parts. Firstly, we construct 
multiple layers of laser range data to represent the environment 
information in different levels of granularity. Secondly, each 
layer of data is fed into a deep neural network model for 
classification, where a graph regularization is imposed to 
the deep architecture for keeping local consistency between 
adjacent samples. Finally, the predicted labels obtained from all 
the layers are fused based on confidence trees to maximize the 
overall confidence. Experimental results validate the effective¬ 
ness of our end-to-end place classification framework in which 
both the multi-layer structure and the graph regularization 
promote the classification performance. Furthermore, results 
show that the features automatically learned from the raw 
input range data can achieve competitive results to the features 
constructed based on statistical and geometrical information. 

I. Introduction 

Place classification is one of the important problems in 
human-robot interactions and mobile robotics, which aims to 
distinguish differences between environmental locations and 
assign a label (corridor, office, kitchen, etc.) to each loca¬ 
tion [33], [17]. It allows robots to achieve spatial awareness 
through semantic understanding rather than having to rely 
on precise coordinates in communicating with humans. Fur¬ 
thermore, the semantic labels has the potential to efficiently 
facilitate other robotic functions such as mapping [20], 
behavior-based navigation [16], task planning [7] and active 
object search and rescue [1]. 

In general, place classification is carried out through 
environment sensing. Laser range finders, cameras and RGB- 
D sensors are the mostly used sensing modalities. Location 
and topological information can also be informative in place 
classification. In this work, it is attempted to exploit both 
the sensory data and location information. We assume all the 
maps in this paper contain these two parts of information and 
some of the maps are labeled with human knowledge. Then 
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the place classification problem can be stated as predicting 
the labels of new environments given the labeled maps. 

By analysing those two forms of data, sensory data and 
location information, we can gain insights into the charac¬ 
teristics of the place classification problem. Raw sensory 
data encode the environment information at different loca¬ 
tions which can provide discriminative information between 
different classes. However, this requires an effective feature 
extraction method and most of the previous works tend to 
extract hand-engineered features from the raw data [15], [27]. 
Our opinion is that the hand held features, may not fully ex¬ 
ploit the potential to achieve higher generalization ability. On 
the other hand, the locations encode the spatial information 
of the environment and indicate the local consistency of the 
labels, which means the positions at spatial proximity have 
higher probability to having the same class labels. 

It is to be noted that another difficulty in place classi¬ 
fication is the influence of different field of views (FOV) 
of the sensors used. For example, if a laser range finder 
collects 180° FOV data facing approximately to a comer 
of a corridor, it may not contain enough information for 
classification. If the laser range finder collects 360° FOV 
data at a door of an office room, the robot might be confused 
to with mixed information from two classes. 

In order to address these problems, in this paper, we 
propose a graph regularized deep learning approach classi¬ 
fication on multi-layer inputs. The pipeline of our system is 
illustrated in Fig. [T] which can be split into three parts: 

1) Construction of multi-layer inputs: The environmental 
information in this paper is represented through the gen¬ 
eralized Voronoi graph (GVG) [4], a topological graph in 
which the nodes correspond to the sensory data and the 
edges denote the relationships. By fusing the information 
and eliminating the end-nodes, we implement a recursive 
algorithm to construct multi-layer inputs with hierarchical 
GVGs. The inputs of higher layers contain information of 
larger field of view, represented by increasingly succinct 
GVG. The features are extracted from each layer of input 
and classified independently. 

2) The graph regularized deep architecture for feature 
learning and classification: We adopt the deep architecture 
that learns features from the raw data automatically. A graph 
regularizer is imposed to the deep architecture to keep the 
local consistency, where an adjacency graph is constructed 
to depict the adjacency and similarity between the samples. 
Our training map and testing maps are fed into the deep 
architecture for feature learning at the same time, which 
forms a semi-supervised learning framework. The output of 
this step is the predicted labels of different layers. 


Construction of multi¬ 
layer inputs 


Feature learning and 
classification 


Decision making 



Topological graph / Features Labels 

Fig. 1. Pipeline of the semi-supervised learning system with multi-layer inputs. 


3) The confidence tree for decision making: After receiv¬ 
ing the classification results of multi-layer inputs, confidence 
trees are constructed according to the topological graph , and 
a decision making process is carried out to maximize the 
overall confidence. 

The remainder of this paper is organized as follows: 
Section [II] reviews the related literature. In Section [III| 
we introduce the construction of our multi-layer inputs 
and the confidence tree for decision making. The semi- 
supervised classification with graph regularization is given in 
Section [IV] Experimental results are presented in Section [V] 
to validate the effectiveness of our end-to-end classification 


framework. Then the paper is concluded in Section VI 


II. Related Works 

There are various sensors that help robots to sense envi¬ 
ronments, such as cameras and laser range finders. Previous 
works have demonstrated the effectiveness of both camera 
data and laser range finder data for classifying places. For 
example, Shi et al. [26] and Viswanathan et al. [31] extracted 
features from the vision data, while Mozos et al. [15] and 
Sousa et al. [27] classified the places based on laser range 
data. In this paper, we focus on the place classification based 
on laser range data, however, our approach can be easily 
extended to other modality of sensors such as vision data. 

Laser range finders can provide nonnegative beam se¬ 
quences describing range and bearing to nearby objects. 
They contain structural information including clutter in the 
environment. Mozos et al. [15] extracted features from the 
360° laser range data and those features were fed into an 
Adaboost classifier to label the environment. Sousa et al. [27] 
reported superior results on a binary classification task using 
a subset of above mentioned features, and the support vector 
machine as the classifier. In our past work, we implemented 
a logistic regression based classifier, as a binary and multi¬ 
class problem contributing to higher accuracies [24], [25]. 
The work was further extended to address the generalizability 
of the solution through a semi-supervised place classification 
over a generalized Voronoi graph (SPCoGVG) [23]. In all of 


these methods, the features were extracted from the laser 
range data based on statistical and geometrical information, 
or so-called hand-engineered features. For instance, the av¬ 
erage and the standard deviation of the beam length, the area 
and perimeter of the polygon specified by the observed range 
data and bearing were included in the feature set. 

Recently, the unsupervised feature learning has drawn 
much attention as the deep learning methods was devel¬ 
oped [12], [11], [2]. The deep learning methods achieved 
remarkable results in many areas, including object recogni¬ 
tion [3], [13], natural language processing [5], [8] and speech 
recognition [6], which demonstrated that discovering and 
extracting features automatically can usually achieve better 
results on representation learning [22], [21], [30]. Inspired by 
the success of unsupervised feature learning, in this article 
we present an end-to-end framework with the deep learning 
method that can learn features automatically from the laser 
range data. 

We also exploit the local consistency of classes with the 
assumption that samples located in the same small region are 
more likely to have the same labels. Previous research has 
included this particular characteristic for performance pro¬ 
motion and many studies were carried out with consideration 
of the local consistency [20], [18], [15], [14], [19]. 

In this paper, we consider the local consistency during the 
feature learning process, where, the features learn to keep 
the local invariance with a graph regularization. There are 
some similar works on implementing the graph regularized 
deep learning models [9], [32]. Both [9] and [32] utilized a 
margin-based loss function proposed by Hadsell et at. [10]. 
These works have demonstrated the effectiveness of the 
graph embedding in dimensionality reduction and image 
classification. 

III. Multi-layer Construction and Decision 
Making 

In this paper, we assume a laser range finder with a 
typical field of view of 180°. This is a limited field of view 
which can give rise to many classification inaccuracies due 























to lack of crucial information. However, the full field of 
view may also lead to misclassifications at the boundaries 
of the two different classes of places. Therefore, considering 
these problems, we propose to construct multi-layer inputs 
for classification followed by fusion of the results. 


A. Construction of Multi-layer Inputs 


1) Data Representation on GVG: In this paper, our multi¬ 
layer inputs is represented by the hierarchical generalized 
Voronoi graph (GVG) [4], a topological graph which has 
been successfully applied to navigation, localization and 
mapping. The general representation of GVG is composed 
of meet-points (locations of three-way or more equidistance 
to obstacles) and edges (feasible paths between meet-points 
which are two-way equidistance to obstacles) [28]. We adopt 
the same resolution as in our previous work [23] to construct 
the first layer GVG, and then higher layers of GVGs are 
constructed to describe the environment at different levels of 
granularity. 

Let’s denote hierarchical GVGs as (G^\G^ 2 \ • • • , G^) 
with G^ = {V^ l \ E^}, where L denotes the number of 
layer, denotes nodes in layer l and E^ denotes edges in 
layer Z. For each layer, the independent sensing information 
is carried by the nodes in V^ l \ and the local connectivity 
is represented by the edges in V^ l \ More specifically, each 
node vf G corresponds to a sequence of range data 
rf\ assigned with the label yP for the training maps, while 
efj G reveals the connection between nodes vP and 
vp with distance df }. 

j l j 


The first layer G W = {V^\ E^} describes the environ¬ 
ment in most detailed level of granularity with the originally 
adopted laser range data. As the laser range finder is of 
180° field of view with 1° angular resolution, each node 
vP G V W corresponds to a sequence of range data rP 
with 180 dimension. 

2 ) Recursive Higher Layer Construction Algorithm: The 
construction of a higher layer GVG is implemented by 
fusing the information carried by connected nodes and then 
eliminating those marginal nodes. Algorithm [T] demonstrates 
the process of building higher layer GVG from a given lower 
layer. We make some definitions here for better explanation 
of the algorithm. N(vi) is defined as the directly connected 
neighbour set of Vi , then Vj G N(vi) means there is an 
edge Cij G E between Vi and Vj. In addition, numel(N) 
is defined as the number of elements contained in N. Then 
numel(N(vi)) = 1 means Vi is an “end-node”, i.e. the node 
without children. Further define M(vi) as the set of end- 
nodes connected to , which is obviously M{vi) C N(vi). 
As seen from Algorithm [T] the construction process fuses 
the information carried by vfs neighbors if Vi is not an end- 
node (detailed fusion process is given in section III-A.3 ), 
otherwise is eliminated from the higher layer. 

The L layer of data can be generated by recursively 
applying Algorithm [T] for L — 1 times, which means by taking 
the output of Zth layer as the input of (Z + l)th layer. This 
process can be illustrated in Figure [2] with L = 3. In this 


Algorithm 1 : Generate higher layer of input from the 
previous layer. 

Input: G^ = {V( l \ E( l p, the range data rP on each 
node vP 

Output: G( z+1 ) = {V( z+1 ), £’( z + 1 )}, the range data 
r- Z+1 ^ on each node 

1 for vP G do 

2 if numel(N(vp )) > 1 then 

3 Preserve vf\ i.e. = vP ; 

4 Construct rf +1 ^ and rf +1 ^ from rP and all of 

the rP carried by vp* G N(vP ); 

5 end 

6 for vP G N(vP ) do 

7 if vP G M(yP) then 

8 Eliminate ef} and vP; 

j 

9 else 

io Preserve ef\ i.e. e^ +1) = e[p, 

n end 

12 end 

13 end 


example, the end-nodes are denoted as red. It is to be noted 
that when moving to higher layers, the number of nodes in 
each layer decreases with the elimination of the end-nodes. 
More details are given in the caption of Figure [2] 

An illustration of the different G^ = {V^ l \ E^}, l = 
1,2,3 layers constructed from a specific map is given in 
Figure [3] In the first layer, the nodes are distributed more 
densely in the map. When approaching higher layers, the tree 
structure represents more and more abstract information. It is 
to be noted that the number of the end-nodes (denoted as red 
asterisks) decreases as the progression of the layers which is 
a consideration for choosing the L = 3 in our experiments. 

3) Data generation: This section describes the details 
about the construction of the higher-layer range data r- Z+ ^ 
and where the latter is generated from the former with 

fixed length. As stated in Algorithm [I] given vP satisfying 
numel(N{vP )) > 1 (i.e. vP is not end-node), range data 
received at the respective nodes are integrated to achieve a 
better perception. 

Given each vP with numel{N(yP)) > 1, firstly a 
local map is generated using occupancy grid mapping [29] 
based on the respective range data in Zth layer, including 
rP carried by vP G N(vp) and rP. This is achieved 
by transforming all rP to r-^’s coordinate frame, which 
assumes the knowledge of the global robot poses at all times. 
In this local map, a virtual scan rf +1 ^ is then generated 
by applying ray casting at position vP with 1° angular 
resolution, which is the same as the setting of the real laser 
range finder. 

As the dimensions of the fused range data rf +1 ^ ) could be 
different in various nodes, linear interpolation on the data is 







Fig. 2. An example of multi-layer GVG: The end-nodes are denoted as 
red. The red nodes vP, vP and in layer 1 are fused with their 
neighbours respectively, where, vP is composed of v^\ 

vP is composed of (v[ X \ v^\ v^\ ) and vP is composed of 

, Ug 1 ^, vP , 'Ug 1 ^). Then all the red nodes are eliminated from layer 1. 
This process will be performed recursively on layer 2 to generate layer 3. 


Algorithm 2 : Decision making on the confidence trees. 

Input: Confidence trees where each node vP denotes 
the predicted label yP and the corresponding 
confidence 

Output: Optimized labels of leaf nodes yp. 

1 Initialize cp = cP and yP = yP; 

2 for / = 2 L do 

3 for vp G do 

4 Average the optimized confidence of vP’s 

children as p 

5 if ^ Ej C S* _1) > c i l) then 

6 Denote 

7 else 

8 Denote c® = c®; 

9 All descendants of vP are assigned with the 

label yfj. 

io end 

n end 

12 end 


then performed to keep same dimension of data throughout 
the process. This leads to an sequence r with fixed 
dimension of 360. 

Acknowledging the fact that the interpolated points may 
not contain high information, a completeness rate, which is 
the proportion of the laser measured data (dimension of rP 
to the whole 360° data (dimension of rp 1 ) is defined as: 

(i) = length^) 
length(rp) 

where l = 2 • • • L. This measure is used in the decision 
making process which is discussed in the next section, thus 
we denote qp = 180/360 = 0.5 for uniformity when 
1 = 1. However, we don’t apply linear interpolation to the 
layer 1 since the initial laser range data rp always has 
the same dimension of 180 and is not necessary for linear 
interpolation. By applying this data pre-processing approach, 
the laser range data in layer 2 to layer L are kept in the fixed 
length of 360. Note that it is always rP which is employed 
to construct the next layer, rather than the pre-processed fP. 

As an example, Figure [4] illustrates the construction of a 
sequence of input in layer 2 using the corresponding inputs 
in layer 1, followed by the result after linear interpolation. 
The details are given in the caption of Figure [4] 


B. Decision on Multi-layer Results 

1) Construction of the Confidence Tree: With the L layer 
of inputs, we can obtain the predicted labels from L inde¬ 
pendent classifiers, which can be formed to be confidence 
trees with L layers shown in Figure |5(a) where each node 
denotes the predicted label yp of vp and its corresponding 
confidence cP . By maximizing the overall confidence of 


each tree structure, it is intended to obtain higher accuracy 
in classification. 

All of these tree structures are built from the dependencies 
in Algorithm [I] except for some minor difference — during 
the construction of these tree structures, a parent node 
owns its children vP and vP G M(vP), while the range 
data of is constructed from the range data carried by 

vP and vP G N(vP). The reason is that for those nodes 
vP G N(vP ) and vP M(vP ), they are also reserved 

in the higher layer as and have their own predicted 

labels, so we don’t consider the influence of vf +1 ^ ) to them. 
It is to be noted that the number of such tree structures is 
equal to the number of nodes left in the layer L, where the 
v[ L ^ are the root nodes of these trees. 

In our framework, two factors are considered when com¬ 
puting the confidence cP, one is the probability pP obtained 
from the classifier for labeling pp and the other is the 
completeness ratio obtained from the input sequence rP 
which is given in M. Then the confidence cP is constructed 
as: 


(0 (z) 

c\ ’ —p\ 


x Q-, 


(0 


( 2 ) 


2) Decision Algorithm: With the confidence trees denot¬ 
ing the predicted label pp and its corresponding confidence 
cP for each given vP, the aim of decision making is then to 
search the optimized confidence cP and assign the optimized 
label yP to each node, leading to the maximum value of the 
overall confidence. 

In each tree structure, we make decisions from children to 
parents while comparing two consecutive layers based on the 
decision Algorithm [2] It is to be noted that for the comparison 
















(a) Layer 1 



(c) Layer 3 

Fig. 3. Multi-layer of the GVG graph = {V^ l \ E^} : l = 1, 2, 3 on Fr79. The red nodes correspond to the end-nodes, which will be eliminated 
in the next layer, and the black nodes will be preserved. The edges reveals the connection between these nodes. 


between layer l and layer l — 1, the confidence of the parent 
is always compared to the average optimized confidence 


(0 


v, v 

of its children and we assume the optimized con¬ 

fidences in layer 1 are known as the original confidences. 
As for the optimized predicted labels, Algorithm [2] tells that 
they are only changed to follow their ancestor when this 
ancestor beats its children in confidence. In other words, if 
none ancestor of a leaf node gain advantages in confidence, 
then this leaf node would keep the initial label as its 
optimized label y^\ Note that although we can obtain the 
optimized labels for all nodes from this decision algorithm, 
only the labels of the leaf nodes are exported as output since 
the classification performance is evaluated based on these 
leaf nodes. An example is given in Figure 5(b) for better 
clarity. 


We can also evaluate the results obtained from those L 
independent classifiers separately with the help of these 
constructed trees. To ensure the fairness, results obtained 
from different layer of classifiers are all compared on the 
accuracy of bottom layer. Obviously, the results observed 
from the input of layer 1 do not need to be modified while 
the higher layers should spread their predicted labels to the 
bottom layer. Given a specific layer l (l > 1), all of the nodes 
on the bottom layer are assigned with the same label as their 
ancestor in layer 1. For example, as shown in Figure |5(b)| 
the , v% , • • • , Ug will be labeled by the v\’ ’s predicted 
label when we evaluate the results of layer 3. 

IV. Semi-supervised Learning and Classification 

We have introduced the construction of multi-layer inputs 
and decision making on the multi-layer results in Section 
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Fig. 4. An example of constructing rP and rP where the axes are in meters. The left four figures illustrate rP and all of the r^p carried by 

vP G N(vP), where the black asterisk node denotes the position of vP, the red asterisk nodes denote the position of vP and the blue nodes denote 

( 2 ) 

the range data collected from the real environment. Then the middle figure shows the constructed r > ' using ray casting. The interpolated sequence is 
given on the right, where the magenta points correspond to the interpolated ones. In this example, we have qP = 332/360 = 0.9222. 




Fig. 5. Confidence trees built from Figure [2] and a corresponding example, 
(a) The confidence tree: Each parent node vp 1 ^ has children vP and 
vP G M{yP). (b) The decision example: in this example, let’s assume 
that the confidence of each node is known. By applying the decision method 
given in Algorithm hi firstly we have the initialization cp = cP and 
y\P = yP ■ And then average confidence of the children in bottom most 
layer are compared with their corresponding parents in the immediate upper 
layer. In the left tree, cP is larger than the average value of cP and cP, 
and therefore cP = 0.8 and both the respective children (vp^ and ) 
are assigned the label y\ . The c 2 ’ is smaller than the average value of 
cp ,cp and cP , hence these leaf nodes remain their initial label and 
= (0.5 + 0.4 + 0.4)/3 = 0.4333. Finally cf > = 0.6 is compared 
with (c[ 2 J + 4 2 J)/2 = 0.6167. Since the confidence of layer 3 is smaller 
than the optimized average confidence combined from layer 1 and layer 2, 
the final optimized confidence is = 0.6167 and the optimized labels 
do not change. By applying the same decision process on the right tree in 
the figure, Ug 1 ^, vP and vP are labeled the same as vP. 


Ill In this section, we discuss the classification problem of 


how to train on each layer with the input data and obtain the 
predicted labels of the testing maps. This is implemented by 
a deep learning structure, with the capability to automatically 
learn features from the raw input data. The L layer of inputs 
are trained through L independent deep learning modes as 
indicated in Figure [I] though, these models have the same 
structure with raw laser range data being the input and 
predicted labels being the output as shown in Figure [6] Thus 
the discussion below in this section is not confined to any 
specific layer and hence the superscripts are omitted. It is to 
be noted that our training process is semi-supervised since 
both the training map and the testing map are employed for 
model training, where only the labels of the training map 
are available. The semi-supervised learning process has the 
advantage of gaining richer information of data distribution, 
while keeping the spatial consistency as we will introduce 
in this chapter. 


A. Semi-supervised Learning with Graph Regularization 

In the classification problem, we denote the training pairs 
as ( Xi G M mxnz , Yi G M lxnz ) as a convention, where 
m denotes the input dimension, ni denotes the number 
of training samples. Particularly, each column in Xi is a 
sequence of laser range data r, i.e. x\ = r^. The testing data 
can be defined in the same way as X u G M mxnn , where 
n u denotes the number of testing samples. Then the task of 
the classification problem is to obtain predicted labels of X u 
given Xi and Yi. In addition, we denote X = [Xi X u \ G 
M mxn as the combination of training data and testing data 
with n = ni + n u , since X is fed into the model as a whole 
during our semi-supervised training process. 

As illustrated in Figure [6] the input is firstly fed into 
a set of fixed parameters (denoted as red) to compute the 
differences between the consecutive beams in each raw 
scan, as the consecutive differences can also provide rich 
information to the place classification and is often employed 
for extracting geometric features in the previous works [15], 










Fig. 6. Model training in semi-supervised learning: The second layer has 
fixed parameters which computes the consecutive differences of our input 
(denoted as red). Then both the input and the output of the second layer 
will be fed into the latter process. For the fine-tuning process, the Jubel 
is imposed to the softmax classifier and all of the parameters in the neural 
network (except the fixed layer) will be adjusted, while J gra ph is imposed 
to the last hidden layer and will only influence the feature learning process. 


[27]. In the practical experiments, we sort both of the 
input and consecutive differences to guarantee the rotational 
invariance. 

From this point on, both the input and output of this 
fixed layer are fed into the stacked auto-encoders for fea¬ 
ture learning. Auto-encoder is the widely used structure 
for building deep architectures, which is composed of an 
encoder and a decoder. By feeding the representation learned 
from the previous encoder as the input into another auto¬ 
encoder, we can obtain the stacked hidden representations 
as shown in Figure [6] Let’s denote sigmoid function as 
f{pc) = 1/(1 + e ~ x ), then the it h layer of encoder and 
decoder can be represented as follow: 


Hi = f(WiHi _i + bi) 

Hi-i=f{WTH i + a) ( ' 1f 

where Hi -1 and Hi-i denote the input and its reconstruc¬ 
tion, Hi denotes the hidden representation and Wi, 6*, Ci 
denote the weighted parameters respectively In this paper, 
the weights in each pair of encoder and decoder are tied 
together as shown in ([3]). 

For each layer of auto-encoder, the unsupervised pre¬ 
training is applied to obtain better parameters than random 
initialization [12] by minimizing the reconstruction cost: 

J pre =-WH^ - Hi^fp (4) 

m 

Note that the decoder is discarded after pre-training while 
the encoder is preserved. The hidden representation learned 
by the last auto-encoder can be regarded as the feature for 
the input to the classifier. 

^hen 2—1, Hi -1 is the raw input — the combination of X and its 
consecutive differences X s . 


In the work reported here, the softmax classifier is ap¬ 
plied to the features learned from stacked auto-encoders for 
classification, which is formulated as follow: 


exp(wj h ) 
J2j exp(wj h) 


( 5 ) 


where pi corresponds to the probability that the hidden 
representation vector h belongs to ith class. 

After pre-training and classification, back propagation can 
be used to fine-tune the whole learning process for further 
promotion, which means the parameters of preserved en¬ 
coders and softmax are trained together. In order to keep the 
local consistency, we add a graph regularization term during 
fine-tuning to learned representation. The cost function of 
the fine-tuning is given as follow: 


Jfine — Jlabel T~ Jgraph 


ni , n n 

= Jlabel{%l,yi) + ~ s ij ||^2 — hj || 

1 2=1 2=1 7 — 1 


2 

( 6 ) 


where the first term corresponds to the prediction error of the 
training data, and the second term is the graph regularization. 
Here hi and hj are the outputs of the last hidden layer with 
respect to the inputs Xi and Xj (Xi and Xj are two arbitrary 
columns in X ), and is the similarity measurement be¬ 
tween the samples Xi and Xj that connected in GVG, which 
is an element of the adjacency graph S = [s^] nXn - Figure [b] 
also illustrates the way our cost function work. The costs 
caused by the prediction error is imposed on the softmax 
classifier and then our graph regularization is imposed on 
the last hidden layer. So during the fine-tuning the Jiabei 
will influence all of the parameters, while J gra ph will only 
influence the parameters for feature learning. 


B. Graph Regularization in Place Classification Problem 

As shown in the learned features hi and hj with large 
weight will be pushed together with the graph regulariza¬ 
tion term. In this section, we describe the details about the 
construction of the adjacency graph S which can be built 
in two steps. Firstly we define the connected relationships 
between samples and then calculate their weights of the 
connected edges. 

In the place classification problem, the connected relation¬ 
ships in the topological graph GVG are directly employed to 
the adjacency graph. Then the samples with close coordinates 
are forced to be represented by the features with close 
distances. As for the weights which corresponds to the 
strength of the graph regularization, it is inversely associated 
with two distances, i.e. the distance between coordinates and 
the distance between the input data, which can be formulated 
as: 

_ « , P ™ 

Sij dX Wxi-xAl 2 


where a and (3 are constant weights, dij denotes the Eu¬ 
clidean distance between the sample coordinates. The second 
term defines the Euclidean distance between the input data. 



















This weighting scheme dose not only evaluate the geomet¬ 
rical information, but also considers the closeness between 
inputs. For example, given an edge that connects two nodes 
belonging to corridor and office respectively, although is 
small, ||Xi — Xj || 2 can be large. Therefore, these two nodes 
are not forced to be too close in the representation space 
however still keeps the discriminative information. 


V. Experiments 


To validate the effectiveness of our end-to-end multi-layer 
learning system, we conduct experiments on six data sets 
collected from six international university indoor environ¬ 
ments (including the Centre for Autonomous Systems at 
the University of Technology, Sydney, several buildings in 
the University of Freiburg, the German Research Centre for 
Artificial Intelligence in Saarbruecken, and the Intel Lab in 
Seattle). As we stated previously, the robot collected range 
data at the GVG nodes using the 2D laser range finder which 
has a maximum range of 30m and a horizontal field of view 
of 180°. 

It is to be noted that the classes defined by humans can 
be somewhat vague and plentiful according to the different 
functions of places. However, the 2D range data do not con¬ 
tain enough discriminative information to classify all these 
human-designed classes. Therefore, after careful thinking, 
we consider 3 target classes as: Class 1-space designed 
for a small number of individuals including cubicle, office, 
printer room, kitchen, bathroom, stairwell and elevator; Class 
2-space for group activities including meeting room and 
laboratory; Class 3-corridor. 

Among these six data sets, two of them (Fr79 and Intellab) 
contain all of the 3 classes but the others contain only parts 
of these classes. We consider the leave-many-out training, 
which means one data set is utilized for training and others 
are used for testing. Therefore, we obtained two groups of 
results by training on Fr79 and Intellab respectively. 

The feature learning and classification model for each 
layer of input is shown in Figure [6] Given the input X E 
M mxn , the dimension configuration for our learning model 
is m — m — 100 — 24 — 3, which means the consecutive 
differences layer has the same dimension as the input, and the 
dimension of our hidden layers are 100 and 24 respectively. 
Thus the dimension of our learned features is 24. Finally the 
output of our model represents a probabilistic measure of 
data belonging to each class. Thus the output dimension is 
the same as the number of our classes. In addition, since we 
perform the interpolation to fix the dimension of the higher 


layers as introduced in Section [III-A.3 so we have m = 180 
for L = 1, and m = 360 for L = 2, 3, • • •. In this paper, we 
choose L = 3. 


TABLE I 

Multi-layer results trained on Intellab. 


Map 

Ll(%) 

L2(%) 

L3(%) 

UTS 

85.20 

89.49 

91.24 

SarrB 

86.55 

87.64 

91.32 

FrUA 

86.23 

92.96 

91.69 

FrUB 

90.29 

98.87 

99.84 

Fr79 

81.99 

85.87 

87.90 

Average 

86.05 

90.97 

92.40 


TABLE II 

Multi-layer results trained on Fr79. 


Map 

Ll(%) 

L2(%) 

L3(%) 

UTS 

81.70 

85.99 

89.93 

SarrB 

84.16 

95.44 

90.46 

FrUA 

90.43 

94.70 

96.91 

FrUB 

88.67 

98.87 

99.51 

Intellab 

72.55 

79.81 

82.73 

Average 

83.50 

90.96 

91.91 

cost function ([6]). In 

general, 

results 

of hig 


better than that of lower layers due to the richer information 
contained in each node on the higher layers. 


B. Multi-layer Results with Graph Regularization 

We also carried out experiments to validate the effective¬ 
ness of the graph regularization. The algorithm remains the 
same as previous settings, however, we changed the value of 
A = 1 to add the graph regularization. In this experiments, 
we pay more attention to the geometrical neighborhood, thus 
we use a = 2/3 and /5 = 1/3 in 0 for the construction of 
the adjacency graph. The classification results are shown in 
Table [111] and Table [IV] which are trained on Intellab and 
Fr79 respectively. The results have the similar trends as in 
Table [I] and Table [II] where higher layers give rise to better 
accuracies. Further comparisons of Table [I] and Table [HI] show 
that the feature learning with graph regularization performs 
better than without it. It reveals that the graph regularization 
has the advantage of improving classification performances 
by keeping the local consistency. 


C. Fusion Results 

Finally, we show the accuracies of the multi-layer graph 
regularized method with fusion in Table [V] and Table 
When compared with the results of each single layer as 


VI 


TABLE III 

Multi-layer results trained on Intellab with graph 

REGULARIZATION. 


A. Multi-layer Results without Graph Regularization 

We first conduct experiments to evaluate the performance 
of our multi-layer inputs. Table [I] and Table [II] shows the 
leave-many-out classification results training on Intellab and 
Fr79 respectively. It is to be noted that the graph regular¬ 
ization is not considered here and therefore, A = 0 in the 


Map 

Ll(%) 

L2(%) 

L3(%) 

UTS 

83.54 

87.3 

92.29 

SarrB 

89.59 

96.31 

90.89 

FrUA 

91.48 

91.77 

96.68 

FrUB 

89.97 

99.19 

99.84 

Fr79 

83.96 

86.12 

88.65 

Average 

87.71 

92.14 

93.67 







shown in Table m and Table GYl the fusion results achieved 
better accuracies. For the results trained on Intellab, the 
average accuracy of fusion results risen to 94.02% from 
1,1:87.71%, £2:92.14% and £3:92.66%, and the results 
trained on Fr79 also reached 93.59% from £1:84.24%, 
£2:92.17% and £3:92.95%. The fused test results trained 
on Intellab are diagrammatically illustrated in Figure [7] It 
is to be noted that confusions between Class 1 (office room 
and other rooms) and Class 2 (meeting room) account for the 
major misclassifications especially in the test map of Fr79. 
The cause might be that Class 1 is featured with narrow 
environment including massive clutters while the Class 2 is 
featured with relatively larger spaces, therefore the comers 
of meeting room are mostly classified as office room and 
other rooms and some center positions of office room are 
assigned as office room. 

We also make comparisons with the results we achieved 
in our previous work SPCoGVG [23]. SPCoGVG is also 
a semi-supervised approach, which is composed of sup¬ 
port vector machine (SVM) and conditional random field 
(CRF) to ensure the generalization ability. We use the 24- 
dimensional hand-engineered features in SPCoGVG, which 
are extracted from the raw range data with geometrical 
knowledge. Notice that our learned features have the same di¬ 
mension as the hand-engineered features in our experiments. 
Seen from Table [V] and Table [VlJ we achieve slightly better 
average results than SPCoGVG. 

VI. Conclusions 

In this paper, we presented an end-to-end place classifi¬ 
cation framework. We implemented a multi-layer learning 
framework, including the construction of multi-layer inputs 
and decision making on the multi-layer results. Each layer 
of inputs were fed into a semi-supervised model for feature 

TABLE IV 

Multi-layer results trained on Fr79 with graph 

REGULARIZATION. 


Map 

Ll(%) 

L2(%) 

L3(%) 

UTS 

80.47 

89.23 

90.02 

SarrB 

87.20 

96.75 

95.23 

FrUA 

91.06 

96.12 

97.47 

FrUB 

89.48 

98.87 

99.51 

Intellab 

73.00 

79.89 

82.51 

Average 

84.24 

92.17 

92.95 


TABLE V 

Graph regularized fusion results trained on Intellab and 
RESULTS USING SPCoGVG. 

Map Multi-layer fusion(%) SPCoGVG(%) 


UTS 

91.24 

90.72 

SarrB 

96.53 

88.72 

FrUA 

95.02 

96.52 

FrUB 

99.84 

98.71 

Fr79 

89.76 

92.04 

Average 

94.48 

93.39 


TABLE VI 

Graph regularized fusion results trained on Fr79 and 
RESULTS USING SPCoGVG. 


Map 

Multi-layer fusion(%) 

SPCoGVG(%) 

UTS 

90.54 

89.84 

SarrB 

98.27 

93.71 

FrUA 

97.23 

97.71 

FrUB 

99.51 

99.19 

Intellab 

82.40 

86.89 

Average 

93.59 

93.47 


learning and classification, which guaranteed the local con¬ 
sistency with a graph regularization. 

Experimental results showed that the higher layer input 
data led to higher classification accuracy, which validated 
the effectiveness of the multi-layer structure. By performing 
the semi-supervised learning with or without graph regu¬ 
larization, we also showed that graph regularization help 
promoting the classification performance by keeping the 
local consistency. Furthermore, the fusion results based on 
the confidence tree achieved comparable results to the state- 
of-art method. In a nutshell, we achieved the generalization 
ability and preserved the local consistency in our end-to-end 
place classification framework. Future work is to apply our 
framework on other type of sensor data, such as RGB-D data, 
which have more representative and discriminative ability. 
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